I will attempt to create a machine learning model that will be able to predict the winner of an NBA game as accurately as possible. I will treat this problem first as binary classification (Win/Loss) and then as a regression problem to try to predict the net score and total score. I will run numerous different classification models, with varying parameters, to see which model generates the best predictions. These will include:
I have gathered data from a few different sources, to compile the following data for games going back to the 2003 season:
Per-season statistics for each team via stats.nba.com:
Per-season advanced statistics for each team via stats.nba.com:
More advanced statistics from fivethirtyeight
# imports to get started
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# load the data and check out the head
games = pd.read_csv('games.csv')
games.head()
games.info()
games.describe()
# get a sense of which variables might be the best predictors
games.corr().home_team_wins.sort_values(ascending=False)
Some of these features will be TOO good at predicting the game outcome, because the ARE the game outcome. But we will want to save those for later as the linear model targets.
games['total_pts'] = games['pts_away'] + games['pts_home']
total_pts = games['total_pts']
net_pts = games['home_net_pts']
games.head(1)
# check for nulls
games.isnull().sum()
# get rid of rows with nulls since there are so few
games.dropna(inplace=True)
games.info()
# set x and y values
features = ['home_days_rest',
'away_days_rest', 'home_win_pct', 'away_win_pct', 'home_3pm',
'home_3pa', 'home_3p_pct', 'home_ftm', 'home_fta', 'home_ft_pct',
'away_3pm', 'away_3pa', 'away_3p_pct', 'away_ftm', 'away_fta',
'away_ft_pct', 'home_raptor', 'away_raptor', 'home_war', 'away_war',
'home_off_rtg', 'home_def_rtg', 'home_net_rtg', 'home_ast_pct',
'home_ast_to', 'home_ast_ratio', 'home_oreb_pct', 'home_dreb_pct',
'home_reb_pct', 'home_tov_pct', 'home_efg_pct', 'away_off_rtg',
'away_def_rtg', 'away_net_rtg', 'away_ast_pct', 'away_ast_to',
'away_ast_ratio', 'away_oreb_pct', 'away_dreb_pct', 'away_reb_pct',
'away_tov_pct', 'away_efg_pct']
target = ['home_team_wins']
Let's check out some of the best teams based on the different categories.
for col in games[['home_3pm','home_3p_pct','home_off_rtg','home_raptor','home_win_pct']]:
print(games.groupby(['home_team_name','season']).max()[col].sort_values(ascending=False).head(5))
print('\n')
print(games.groupby(['home_team_name','season']).min()['home_def_rtg'].sort_values().head(5))
import plotly.express as px
px.histogram(games,x='home_3pm')
px.histogram(games,x='home_raptor')
px.histogram(games,x='home_net_rtg')
from sklearn.metrics import classification_report, confusion_matrix
def run_tests(y,pred):
print(classification_report(y,pred))
print(confusion_matrix(y,pred))
Let's get a baseline model using only the winning percentages of each team, and try to beat it's score.
y = games[target]
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, cross_val_predict
lr = LogisticRegression()
pred_base = cross_val_predict(lr,games[['home_win_pct','away_win_pct']],y.values.ravel(),cv=5)
run_tests(games[target],pred_base)
First let's run a Logistic Regression, using all features.
pred_all = cross_val_predict(lr,games[features],games[target].values.ravel(),cv=5)
run_tests(y,pred_all)
Slightly worse actually. Let's try normalizing the predictive features prior to running the model, using the z-score method.
# normalize the features before predicting
from sklearn.preprocessing import StandardScaler
features_scaled = StandardScaler().fit_transform(games[features])
lr2 = LogisticRegression(max_iter=10000)
pred_scaled = cross_val_predict(lr2,features_scaled,y.values.ravel(),cv=5)
run_tests(y,pred_scaled)
Hardly any better than the baseline model. Let's try a Ridge Regularization in an attempt to reduce the number of competing predictive features.
# Ridge Regression
from sklearn.linear_model import RidgeClassifier
ridge = RidgeClassifier()
pred_ridge = cross_val_predict(ridge,games[features],y.values.ravel(),cv=5)
run_tests(y,pred_ridge)
Now let's try other classification models, Random Forest and K Nearest Neighbors, and see what happens.
# Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features_scaled,y.values.ravel(),test_size=0.3,random_state=22)
rf = RandomForestClassifier()
param_grid = {'n_estimators': [10,25,50,100,500,1000]}
grid = GridSearchCV(rf,param_grid,verbose=1)
grid.fit(X_train,y_train)
grid.best_params_
pred_rf = grid.predict(X_test)
run_tests(y_test,pred_rf)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=2)
pred_knn = cross_val_predict(knn,features_scaled,y.values.ravel(),cv=5)
run_tests(y,pred_knn)
Those were much worse. I guess the moral of the story here is, at least with the data I have available, you won't get much more accurate in choosing winners than simply choosing the team with the highest winning percentage. There is too much variability in game outcomes.
That being said, let's see if we can use Linear Regression to predict the net score (to be compared to the Vegas odds) and the total score of the game (to be compared to the over/under).
total_pts = games['total_pts']
net_pts = games['home_net_pts']
px.histogram(games,x='total_pts')
px.histogram(games,x='home_net_pts')
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import sklearn.metrics as metrics
def lin_test(y,pred):
print(np.sqrt(metrics.mean_squared_error(y,pred)))
pred_lin_total = cross_val_predict(LinearRegression(),features_scaled,total_pts,cv=5)
pred_lin_net = cross_val_predict(LinearRegression(),features_scaled,net_pts,cv=5)
print('RMSE Total Points: ',lin_test(total_pts,pred_lin_total))
print('RMSE Net Points: ',lin_test(net_pts,pred_lin_net))
How does that compare to the standard deviations?
print('Total points STD: ',np.std(games['total_pts']))
print('Net Points STD: ',np.std(games['home_net_pts']))
Let's try various polynomial features, and a ridge regression.
games_poly2 = PolynomialFeatures(degree=2,interaction_only=True).fit_transform(features_scaled)
games_poly3 = PolynomialFeatures(degree=3,interaction_only=True).fit_transform(features_scaled)
from sklearn.linear_model import Ridge
pred_total_poly2 = cross_val_predict(Ridge(),games_poly2,total_pts,cv=5)
pred_total_poly3 = cross_val_predict(Ridge(),games_poly3,total_pts,cv=5)
pred_net_poly2 = cross_val_predict(Ridge(),games_poly2,net_pts,cv=5)
pred_net_poly3 = cross_val_predict(Ridge(),games_poly3,net_pts,cv=5)
print('RMSE Total Points: ',lin_test(total_pts,pred_total_poly2))
print('RMSE Net Points: ',lin_test(net_pts,pred_net_poly2))
print('RMSE Total Points: ',lin_test(total_pts,pred_total_poly3))
print('RMSE Net Points: ',lin_test(net_pts,pred_net_poly3))
So the simplest model turned out to be the best, although it is not very predictive, with a RMSE of 18 compared to a standard deviation of 18 points.
At the end of the day, I would not base gambling decisions on this model, unless the spread or over/under is significantly different from the model prediction.